Skip to content

Add perf profiling infrastructure, switch HXB to uncompressed storage, replace Domainslib with WorkerPool, and document compiler hotspots#77

Draft
Copilot wants to merge 7 commits intosimn-developmentfrom
copilot/prepare-low-level-profiling-tools
Draft

Add perf profiling infrastructure, switch HXB to uncompressed storage, replace Domainslib with WorkerPool, and document compiler hotspots#77
Copilot wants to merge 7 commits intosimn-developmentfrom
copilot/prepare-low-level-profiling-tools

Conversation

Copy link
Copy Markdown

Copilot AI commented Mar 15, 2026

Profiling setup and analysis for the Haxe compiler eval path and HXB roundtrip, identifying GC pressure, type-traversal hotspots, and zip I/O as the primary cost centres. Includes a concrete performance fix for HXB zip I/O, a replacement of Domainslib with a custom WorkerPool, a comprehensive profiling guide, and a corrected analysis of polymorphic comparison sites verified via OCaml assembly inspection.

Infrastructure changes

  • src/dune: adds a profile build environment with (ocamlopt_flags (:standard -O2 -g -unbox-closures)) so a DWARF-instrumented binary can be produced without touching the release build:
    dune build --profile profile src/haxe.exe
    
    The resulting binary is accepted by perf record --call-graph dwarf and resolves OCaml symbols correctly.

Code changes

  • src/compiler/generate.ml: switched HXB zip compression from level 6 (Deflated) to level 0 (Stored). This eliminates all zlib compression/decompression overhead:

    • Read-phase hxblib get bytes: 0.090 s → 0.021 s (4.3× faster)
    • Archive size: 3.6 MB → 6.5 MB (1.8× larger — favorable tradeoff)
  • src/context/parallel.ml: replaced Domainslib with a custom WorkerPool using Domain.spawn + Condition.wait. Workers sleep (zero CPU) between parallel_for calls instead of spin-waiting. Key design points:

    • Workers are spawned once at pool creation and block on Condition.wait when idle
    • Work is distributed via contiguous chunk partitioning across N domains (main + N-1 workers)
    • Nested parallel_for calls (e.g. analyzer iterating types → fields) detected via atomic busy flag and fall back to sequential, preventing state corruption
    • Exception propagation with backtrace from worker domains
    • ManagedPool retained as a thin lazy wrapper matching the old lifecycle
  • src/compiler/server/serverCompilationContext.ml: updated ManagedPool.create to use the new argument-free constructor.

  • src/dune and haxe.opam: removed domainslib dependency.

Documentation

  • extra/PROFILING.md: comprehensive profiling guide covering how to use perf, olly gc-stats, and --times flags, with full results, recommendations, and documentation of the WorkerPool architecture.

How to profile

# lower perf paranoia (requires sudo)
sudo sysctl -w kernel.perf_event_paranoid=-1

# record eval unit tests
perf record -F 999 -g --call-graph dwarf -o eval.perf \
  haxe --cwd tests/unit compile-macro.hxml

# built-in timer breakdown (most informative for day-to-day work)
haxe --cwd tests/unit compile-macro.hxml \
  --times -D times.eval -D times.filter=2 -D times.analyzer=2

# OCaml 5 GC profiling
olly gc-stats -- haxe --cwd tests/unit compile-macro.hxml

Findings

Eval unit tests (--times + perf + olly, ~2.7 s)

Source Cost
GC (do_some_marking + alloc/promote/sweep) ~26 % of CPU (olly confirmed)
Macro execution (utest ancestorHasInitializeUtest alone) 28 %
Typing 26 %
Filters (handle_abstract_casts, fix_return_dynamic_from_void_function) 13 %
Parsing 12 %
Analyzer (fusion dominates) 7 %
compare_val (polymorphic =/compare) 3.1 %

compare_val analysis (3.1 % of perf time) — verified via assembly inspection

OCaml's native compiler specializes = for types with all-constant constructors and for comparisons against literal constant constructors. Verified by compiling test files with ocamlfind ocamlopt -S and inspecting the generated assembly.

Truly polymorphic (calls caml_equal):

  • typeloadCheck.ml:187field_kind = field_kind (has Var of var_kind)
  • optimizerTexpr.ml:189tconstant = tconstant (has TInt of int32, TString of string, etc.)
  • fields.ml:141,194e.eexpr = TConst TSuper (structured literal, but shallow comparison)
  • Path comparisons (~10 sites) — string list * string
  • nullSafety.ml — polymorphic Hashtbl with safety_subject keys (also contributes to caml_hash 2.4%)

NOT polymorphic (specialized by OCaml, corrected from initial analysis):

  • typeloadCheck.ml:66m1 = MethDynamic: method_kind all-constant constructors → direct cmpq
  • typeloadCheck.ml:432mkind m1 = mkind m2: projects to int → direct cmpq
  • fields.ml:41!(a.a_status) = Const: literal constant constructor → direct cmpq

HXB roundtrip

  • Write phase: +0.38 s over normal eval (13 %); HxbWriter.write_type_instance and Zlib.update_crc visible in perf.
  • Read phase: ~0.85 s — 3× faster than a full compile. After switching to Stored compression, hxblib get bytes dropped from 0.090 s to 0.021 s.

GC overhead (olly gc-stats)

  • Eval: 26 % of CPU, 30.7 % on main domain
  • Roundtrip: 43 % of CPU (write phase creates many temporary buffers)
  • P99 GC latency: 3.0 ms, max: 6.0 ms

WorkerPool benchmark (4-core CI runner)

Benchmark Domainslib WorkerPool Delta
Eval unit tests (median of 5) 2762 ms 2809 ms +1.7 % (noise)
JVM compilation (median of 3) 1062 ms 1070 ms +0.8 % (noise)

Throughput is within measurement noise. The primary benefit is eliminating idle CPU consumption: Domainslib workers spin-wait even when no work is available, whereas WorkerPool workers block on Condition.wait and consume zero CPU between parallel sections.

Remaining improvement opportunities

  1. compare_val — replace polymorphic (=)/compare with typed equality in the confirmed polymorphic sites (field_kind, tconstant, path comparisons). Switch nullSafety.ml to functorized hash tables with custom hash/equal for safety_subject.

💬 Send tasks to Copilot coding agent from Slack and Teams to turn conversations into code. Copilot posts an update in your thread when it's finished.

…-times data

Co-authored-by: Simn <634365+Simn@users.noreply.github.com>
Copilot AI changed the title [WIP] Add setup for low-level profiling of compiler Add perf profiling infrastructure and document compiler hotspots Mar 15, 2026
Copilot AI requested a review from Simn March 15, 2026 09:01
@Simn
Copy link
Copy Markdown
Owner

Simn commented Mar 15, 2026

  • GC pressure — dominant cost. Allocations in Texpr.map_expr, TFunctions.follow, and Stdlib.List.map are the primary drivers. Profile with memtrace to locate top allocators; consider in-place traversal patterns.

@copilot This is usually fine in OCaml due to its generational GC, so you might be jumping to conclusions too quickly here. But memtrace sounds like a good plan, go ahead and do that!

  • compare_val (3.3 %) — replace polymorphic (=)/compare with typed equality in hot typing/filter loops.

Indeed, please identify the places where this occurs.

  • HXB zip decompression — storing entries as Stored (uncompressed) eliminates decompression cost entirely at the expense of archive size; alternatively, pre-decompress and cache per-module bytes after first access.

Agreed, let's store without compression.

  • HxbWriter.write_type_instance — more aggressive type-reference deduplication (interned index) would shrink the archive and reduce write time.

I'm not sure what you mean here, please implement it so I can take a look!

  • Domainslib.Multi_channel spin-wait — worker domains spin even for small jobs; add a module-count threshold before spawning domains.

I don't have a good solution for this because the Domainslib API only allows fixed domain pools (I think). Can you think of a better approach for managing domains?


Also, please be aware that your opam installation got stuck waiting for a y input, so make sure OPAMYES=1 is set when running it.

…ith perf/olly/memtrace findings

Co-authored-by: Simn <634365+Simn@users.noreply.github.com>
Copy link
Copy Markdown
Author

Copilot AI commented Mar 15, 2026

  • GC pressure — dominant cost. Allocations in Texpr.map_expr, TFunctions.follow, and Stdlib.List.map are the primary drivers. Profile with memtrace to locate top allocators; consider i...

Addressed in 930c728:

GC / memtrace: memtrace doesn't work on OCaml 5 multicore (Gc.Memprof is unsupported). I used olly gc-stats (OCaml 5's runtime_events_tools) instead — GC overhead is 26% of CPU for eval, 43% for roundtrip. Full results in extra/PROFILING.md.

compare_val (3.1%): Identified via perf --symbol-filter. Called through both caml_compare (2%) and caml_equal (1%). Key sites:

  • typeloadCheck.ml:187| a, b when a = b on field_kind
  • typeloadCheck.ml:66,432MethDynamic / method kind comparisons
  • fields.ml:41!(a.a_status) = Const on anon_status
  • optimizerTexpr.ml:189a = b on constants
  • nullSafety.ml — 26 Hashtbl.find/mem calls using polymorphic hash+equality

Full list in extra/PROFILING.md §2.

HXB compression: Changed to Stored (level 0). Read-phase get bytes dropped from 0.090s → 0.021s (4.3×). Archive grows 3.6 MB → 6.5 MB.

write_type_instance: Already deduplicated within expressions via write_texpr_type_instance which serializes to bytes then interns via StringPool. At 0.38% of perf time, the gains from further deduplication are minimal — left as-is with explanation in the guide.

Domainslib: The ManagedPool already supports lazy acquire/release. The issue is that Domainslib.Task.setup_pool creates a fixed set of domains that spin-wait. Two approaches: (1) use Domain.spawn directly for short parallel sections like HXB export, replacing ManagedPool for those paths, or (2) add a module-count threshold (e.g. >50 modules) before acquiring the pool. Approach (1) eliminates spin-wait entirely but loses automatic work-stealing; approach (2) is simpler. Documented in extra/PROFILING.md §5.

Also noted re: OPAMYES=1 — will use that going forward.

Copilot AI changed the title Add perf profiling infrastructure and document compiler hotspots Add perf profiling infrastructure, switch HXB to uncompressed storage, and document compiler hotspots Mar 15, 2026
Copilot finished work on behalf of Simn March 15, 2026 09:21
@Simn
Copy link
Copy Markdown
Owner

Simn commented Mar 15, 2026

@copilot

  • Please investigate the = situation because I don't think you're right about that. If the types are known at compile-time in OCaml, this should not lead to a polymorphic compare. Do some research and inspect the generated code to confirm.
  • Could you suggest a framework for using from Domainslib to Domain.spawn? I don't really have a vision for how to control the number of domains and how to actually interact with such a setup. We don't want to constantly set up and tear down domains in such cases, so this has to be managed somehow.

Copilot AI and others added 2 commits March 15, 2026 09:49
…y inspection; propose Domain.spawn framework

Co-authored-by: Simn <634365+Simn@users.noreply.github.com>
Co-authored-by: Simn <634365+Simn@users.noreply.github.com>
Copy link
Copy Markdown
Author

Copilot AI commented Mar 15, 2026

@copilot

  • Please investigate the = situation because I don't think you're right about that. If the types are known at compile-time in OCaml, this should not lead to a polymorphic...

Addressed in 7ed90ce and c8e7434.

= investigation: You were right — I verified by compiling test files with ocamlfind ocamlopt -S and inspecting the generated assembly. OCaml's native compiler specializes = in two cases:

  1. Types where all constructors are constant (no args) — e.g. method_kind → emits cmpq
  2. Comparing a variable against a literal constant constructor, even if the type has structured variants — e.g. anon_status = Const → emits cmpq $3, %rax

The sites I previously listed that are NOT polymorphic (corrected):

  • typeloadCheck.ml:66m1 = MethDynamic: method_kind all-constant → direct compare
  • typeloadCheck.ml:432mkind m1 = mkind m2: projects to int first → direct compare
  • fields.ml:41!(a.a_status) = Const: literal constant ctor → direct compare

Still truly polymorphic (calls caml_equal):

  • typeloadCheck.ml:187field_kind = field_kind (has Var of var_kind)
  • optimizerTexpr.ml:189tconstant = tconstant (has TInt, TString, etc.)
  • fields.ml:141,194e.eexpr = TConst TSuper (structured literal of huge variant type, but shallow)
  • Path comparisons — string list * string
  • nullSafety.ml — polymorphic Hashtbl with safety_subject keys (also contributes to the caml_hash 2.4%)

Full corrected analysis with assembly evidence in extra/PROFILING.md §2.

Domain.spawn framework: Added a concrete proposal in extra/PROFILING.md §5. The idea:

  • Replace the persistent Domainslib pool with Domain.spawn + Domain.join for each parallel section
  • All compiler parallel sections are simple data-parallel loops (array of modules/types) — no work-stealing or nested tasks needed
  • parallel_for ~num_domains length f partitions work into contiguous chunks, spawns N-1 domains, runs one chunk on the calling domain, then joins all
  • Add a min_parallel_items threshold (e.g. 32) to skip spawning for small arrays
  • ManagedPool.t is replaced by a simple num_domains : int in the server context
  • Zero idle CPU between parallel sections (no spin-wait)

Copilot finished work on behalf of Simn March 15, 2026 09:51
@Simn
Copy link
Copy Markdown
Owner

Simn commented Mar 15, 2026

@copilot I'm very curious about that parallel redesign. Go ahead and implement it, and make sure to measure the difference accurately. If this leads to the removal of Domainslib as a dependency, that'd be a bonus.

Copilot AI and others added 2 commits March 15, 2026 10:30
…ro idle CPU

Replaces the Domainslib dependency with a custom WorkerPool that uses
Domain.spawn for workers that block on Condition.wait between calls.

Key design:
- Workers sleep (zero CPU) when no parallel_for is active
- Nested parallel_for calls detected via atomic busy flag, fall back
  to sequential (matches Domainslib's effective behavior)
- Exception propagation with backtrace from worker domains
- ManagedPool retained as lazy wrapper with same lifecycle

Benchmark (4-core CI runner):
  Eval unit tests: 2762ms (Domainslib) → 2809ms (WorkerPool) [+1.7%, noise]
  JVM compilation: 1062ms (Domainslib) → 1070ms (WorkerPool) [+0.8%, noise]

Removes domainslib from src/dune and haxe.opam.

Co-authored-by: Simn <634365+Simn@users.noreply.github.com>
…ughput

Co-authored-by: Simn <634365+Simn@users.noreply.github.com>
Copy link
Copy Markdown
Author

Copilot AI commented Mar 15, 2026

@copilot I'm very curious about that parallel redesign. Go ahead and implement it, and make sure to measure the difference accurately. If this leads to the removal of Domainslib as a dependency, that'...

Implemented in d9dd515. domainslib is fully removed from src/dune and haxe.opam.

Design: WorkerPool in src/context/parallel.ml spawns N-1 worker domains at pool creation. Workers block on Condition.wait between parallel_for calls (zero CPU). Work is distributed via contiguous chunk partitioning. A key detail: nested parallel_for calls (e.g. analyzer iterating types → iterating fields per type) are detected via an atomic busy flag and fall back to sequential, which prevents state corruption.

Benchmarks (4-core CI, median):

Domainslib WorkerPool
Eval 2762 ms 2809 ms (+1.7%)
JVM 1062 ms 1070 ms (+0.8%)

Within noise. The throughput is equivalent; the benefit is zero idle CPU (workers sleep instead of spin-waiting).

Copilot AI changed the title Add perf profiling infrastructure, switch HXB to uncompressed storage, and document compiler hotspots Add perf profiling infrastructure, switch HXB to uncompressed storage, replace Domainslib with WorkerPool, and document compiler hotspots Mar 15, 2026
Copilot finished work on behalf of Simn March 15, 2026 10:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants